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Abstract —Job scheduling for a MapReduce cluster has been 
an active research topic in recent years. However, measurement 
traces from real-world production environment show that the 
duration of tasks within a job vary widely. The overall elapsed 
time of a job, i.e. the so-called flowtime, is often dictated 
by one or few slowly-running tasks within a job, generally 
referred as the “stragglers”. The cause of stragglers include 
tasks running on partially/intermittently failing machines or 
the existence of some localized resource bottleneck(s) within 
a MapReduce cluster. To tackle this online job scheduling 
challenge, we adopt the task cloning approach and design the 
corresponding scheduling algorithms which aim at minimizing 
the weighted sum of job flowtimes in a MapReduce cluster based 
on the Shortest Remaining Processing Time scheduler (SRPT). 
To be more specific, we first design a 2-competitive offline algo¬ 
rithm when the variance of task-duration is negligible. We then 
extend this offline algorithm to yield the so-called SRPTMS+C 
algorithm for the online case and show that SRPTMS+C is 
(1 + e) — speed o(%) — competitive in reducing the weighted 
sum of job flowtimes within a cluster. Both of the algorithms 
explicitly consider the precedence constraints between the two 
phases within the MapReduce framework. We also demonstrate 
via trace-driven simulations that SRPTMS+C can significantly 
reduce the weighted/unweighted sum of job flowtimes by cutting 
down the elapsed time of small jobs substantially. In particular, 
SRPTMS+C beats the Microsoft Mantri scheme by nearly 25% 
according to this metric. 

Index Terms —MapReduce, job Scheduling, SRPT, cloning, 
weighted job flowtime, competitive bound 

I. Introduction 

MapReduce |8| and its open-source realization via Hadoop 
[T) have emerged as the defacto framework to support large- 
scale parallel/distributed processing and data analytics. Under 
the MapReduce framework, the overall computation of a job 
is decomposed into 2 separate phases, namely, the Map phase 
and the Reduce phase. Within each phase, many relatively 
small tasks are executed in parallel across a large number 
of machines within the MapReduce cluster. The MapReduce 
computational model also requires that the Reduce phase of a 
job cannot begin until all the tasks within its Map phase have 
been completed. A key feature of catalyzing the widespread 
adoption of MapReduce framework is the ability to transpar¬ 
ently deal with the challenges of executing these tasks in a dis¬ 
tributed setting. One of such fundamental challenges is the dis¬ 
proportionately long-running tasks, or the so called stragglers, 
which corresponding to tasks that are unfortunately assigned 
to machines suffering from partially/intermittently failures or 
localized resource bottleneck(s). Measurement traces from the 


real-world production environment (4| indicate that stragglers 
lead to a large variation in completion times among tasks in 
the same job phase and delay job completion substantially. 

The dominant technique to mitigate the straggler problem 
is via speculative execution: a strategy which preventively 
or reactively handle stragglers via automatically launching of 
extra copies of a task on alternative machines. In particular, 
there are two main classes of speculative execution strategies 
proposed in the literature, namely, the Cloning approach |2j 
and the Straggler-Detection-based one 0, 0, 0, 0, (28). 
Under the Cloning approach, extra copies of a task are 
scheduled in parallel with the initial task and the one which 
finishes first is used for the subsequent computation. For the 
Straggler-Detection based approach, the progress of each task 
is monitored by the system and backup copies are launched 
when a straggler is detected. Unfortunately, most of these 
speculative execution schemes are based on simple heuristics 
and generally lack any performance guarantee. 

To take a more systematic approach for the design of specu¬ 
lative execution strategies, our previous work (e.g., 0-@) 
propose several optimization-based schemes: (26) proposes to 
make clones for each task of the arriving jobs by running 
a convex program which aims at minimizing the total job 
elapsed time, which is the time-span between the job arrival 
and its completion. This is commonly referred as the flowtime 
of a job in the scheduling literature. However, (24)-(26) 
still face two fundamental limitations. Firstly, the precedence 
constraints between the two phases in the MapReduce frame¬ 
work are ignored. Secondly, the complete distribution of task 
duration within each job needed to be known in advance when 
solving the optimization problem. Ideally, we want to take 
the precedence constraint into consideration and reduce the 
amount of information required for optimizing the speculative 
execution scheme. 

With the above ideas in mind, in this paper, we explicitly 
model the precedence between the Map and Reduce phase and 
assume that only the first and second moments of task duration 
are known a priori. Similar to (26) , we aim to minimize 
the weighted sum of job flowtime via task cloning. This 
objective yields offline and online versions of the scheduling 
problem which turns out to be more difficult than the NP- 
Hard scheduling problem presented in (29) . Our main results 
include the approximated algorithms which are motivated from 
the Shortest Remaining Processing Time scheduler (SRPT) 
in both offline and online setting. To be more specific, we 


obtain a 2-competitive algorithm for the offline case when the 
variance of task-duration is negligible and a (1 + e) — speed 
o(-r) — competitive algorithm for the online case where 
0 < e < 1. For the online version of the algorithm, we assume 
resource augmentation 03 , which is necessary to circumvent 
lower bounds for the parallel scheduling on multiple machines. 
Under the resource augmentation analysis, the adversary is 
given m unit-speed machines and our algorithm is given M 
processors of speed s where s > 1. To summarize, this paper 
has made the following technical contributions: 


After reviewing the related work in Section[II] we cast the 
dynamic scheduling problem as an stochastic optimiza¬ 
tion problem that focuses on finding a cloning scheme 
to minimize the weighted sum of job flowtimes (Section 

IIIS- 

Motivated by the SRPT scheduler, we design a 2- 
competitive algorithm for the offline case when the vari¬ 
ance of task duration is negligible. Moreover, we show 
that, with high probability, each job can complete within 
a time-span which is larger than the optimal algorithm 
by only a constant factor times the standard derivation of 
task duration (Section m 

Extended from the offline algorithm, we design the so- 
called SRPTMS+C algorithm for the online case. By 
adopting the method of potential function analysis, we 
prove that SRPTMS+C is (1 + e) — speed o(^) — 
competitive for the weighted sum of job flowtimes when 
0 < e < 1 (Section |V). 

Before concluding our work in Section VII we demon¬ 
strate via trace-driven simulations that SRPTMS+C can 
significantly reduce the weighted average of job flow- 
times by cutting down the elapsed time of small jobs sub¬ 
stantially. In particular, SRPTMS+C beats the Microsoft 
Mantri scheme by nearly 25% according to this metric 
(Section [Vi]). 


II. Related work 

The straggler problem was first identified in the original 
MapReduce paper (8]. Since then, various solutions have 
been proposed to deal with it using the Straggler-Detection- 
based speculative execution strategy g), Q, 0, J28). These 
solutions mainly focus on promptly identifying stragglers and 
accurately predicting the performance of running tasks. One 
fundamental limitation is that detection may be too late for 
helping small jobs as it needs to wait for the collection of 
enough samples while monitoring the progress of tasks. To 
avoid the extra delay caused by the straggler detection, cloning 
approach was proposed in |2j. This approach relies on cloning 
very small job in a greedy manner to mitigate the straggler- 
effect and is based on simple heuristics. In contrast, we 
develop an optimization framework to make clones for each 
arriving job. Recently, 0 presents GRASS, which carefully 
adopts the Detection-based approach to trim stragglers for 
approximation jobs. GRASS also provides a unified solution 
for normal jobs. However, one limitation is that it only 
prioritizes the tasks within a job and it remains a problem 


to prioritize different jobs (i.e., the scheduler is not optimized 
and unknown to the readers). 

Prior research on job scheduling for a MapReduce system 
includes 0, @, Q0, (19), (20, (20, (29): @, (6), (20 de¬ 
rive performance bounds for minimizing the total completion 
time. (22) designs the Coupling scheduler , which mitigates the 
starvation problem caused by Reduce tasks in large jobs. G3> 
(20, (29) extend the SRPT scheduler to minimize the total job 
flowtime under different settings. However, all of these studies 
assume accurate knowledge of task durations and hence do not 
support speculative copies to be scheduled dynamically. 

Finally, the SRPT scheduler has been studied extensively in 
traditional parallel scheduling literature. In particular, SRPT 
has proven to be (1 + e) — speed - — competitive for total 
flowtime on m identical machines under the single task case 
©• In this paper, we extend the SRPT scheduler to yield an 
online scheduler which can mitigate stragglers as well. 

III. System Model and Problem Formulation 

Consider a MapReduce Cluster which consists of M ma¬ 
chines. A machine could represent a processor, a core or a 
virtual machine. Assume a set of jobs J = {Ji, J 2 , • • • } 
entering into the cluster over time. Job Ji £ J which 
arrives at the cluster at time a,; consists of m, map tasks 
and r.i reduce tasks. Each job has a weight w t which re¬ 
flects its priority. Let J™ = {5™' 1 ,S™’ 2 , ■ ■ ■ ■ 5™’ m '} and 
JJf = {(5L 1 , Sf’ 2 , • • • , Sf’ r '} be the set of map and reduce tasks 
of Ji respectively. Each machine can only hold one map or 
reduce task at any time and all the machines are identical. 

As described in Section[I] the large variation in task comple¬ 
tion times is caused by machine failures or localized resource 
bottleneck(s). Instead of modelling the variance of machine 
speed directly, we consider that the variation is caused by task 
workload and each machine processes all the tasks with the 
same speed. Such transformation does not violate the variation 
in task completion times and simply our analysis. 

We assume time is slotted and a centralized scheduler 
collects the status of jobs within the cluster at the beginning of 
each time slot. If a machine runs a task at speed s, it will take 
p(-)/s time slots to complete the task where p(-) denotes the 
workload of this task. Without loss of generality, we assume 
that all the machines run at unit speed. 

For ease of presentation, throughout the whole paper, we use 
c £ {m, ?’} to capture the map- or reduce-related statements 
for all the tasks, i.e., when c is used, it is fixed to either m or r. 
The workload of task 5f' 3 £ Jf is pf’ 3 where pf’ 3 is a random 
number for all i,j. Under the unit speed case, pf’ 3 also denotes 
the processing time of task Sf’ 3 on a particular machine. We 
also assume the workload of all tasks in a job share the same 
mean Ef and standard deviation of. The parameters Ef and 
of are known in advance to the scheduler for all i. 

Table U summarizes all the notations in this model. 

A. Speedup via task cloning 

In this model, we adopt the cloning approach to mitigate the 
negative impact of stragglers. Cloning helps to speed up the 



Table I 

The notations of the scheduling parameters 


Notations 

Corresponding meaning 

j 

The set of jobs arriving at the cluster 

j f 

The set of map/reduce tasks of job with 
c = {map for map task; reduce for reduce task} 

CLi 

Arrival time of job J} 


Time when job Jj completes its work 

Wi 

Weight of job Ji 

Pi’- 7 

Workload of the map/reduce task Sj (p ?) 

Ej 

The mean of the workload for map/reduce task in Ji 

v? 

The SD of the workload for map/reduce task in Ji 

M 

Total number of machines in the Cluster 

sf(x) 

The speedup function of a map/reduce task in Ji 

4 

Time when task is scheduled. 

V 

The duration of task 5^' 3 . 

ft 3 

Time when task 8^ ,J completes. 

M(t) 

Number of machines running map tasks at time t. 

R(t) 

Number of machines running reduce tasks at time t. 

c i3 

_ ElL _ 

Number of copies made for task 8^ ,J . 


completion of a task via picking up the copy which finishes 
first of this task. To capture such speedup, we define a function, 
which is s£( x), for each phase of every single job where x is 
the number of copies made for a particular task. For example, 
it takes p c f 3 /s1{2) time slots to complete task 5^’ J on average 
if two copies are made when scheduling Sf 1 . Here, we assume 
that s{(x) satisfies the following two properties: 

• sfjx) is a concave and strictly increasing function of x, 
Vi. 

• s?(l) = 1 and sf(x) < x for all x > 0, Vi. 

These two properties are applicable to most distributions of 
the task duration observed in practice. For example, 
show that the task duration for a MapReduce cluster follows 
a heavy-tail distribution. Below, we illustrate the convexity of 
the speedup function when the task duration follows a Pareto 
heavy-tail distribution. In particular, if the duration j/f J of task 
follows the following Pareto distribution, we have: 


Pr[p1’ j < t) 


0 otherwise 


Finally, a job completes when all the reduce tasks are finished, 
i.e., fi = maxj{/[’ J } V«; 1 < j < r». 

For this model, we aim to minimize the weighted sum of 
job flowtimes by carefully making cloning decisions and prior¬ 
itizing different jobs. This formulation yields an optimization 
problem shown below: 


min • E[/j - o»] (la) 

i 

s.t. ml > ai Vi; 1 < j < mi (lb) 

rl > at Vi; 1 < j < n (lc) 

E[C J ] = K/s?{x r J ) Vt; 1 < j < m, (Id) 
m d ] = El/slW 3 ) Vz; 1 < j < n (le) 
f™’ : > = m{ +1™’ 3 Vi; l < j <mi (If) 
fI’ J = max{max{/” l ' fc }, r 3 } + t\' 3 Vi;j (lg) 

k 

y x™’’ 3 = M(t) \/t (lh) 

m{>t; f™' j <t 

^2 x i’ : = (!i) 

r? >#; 

M(t) + R(t) < M Vi (lj) 

fi = max{/[ J } Vi; 1 < j < fj (lk) 

j 


Constraint ( |Td} and ( fle| ) illustrate the speedup property for the 
map tasks and reduce tasks respectively. Constraint ([Tg]) is due 
to the precedence constraint of the Map and Reduce phase. 

Remark 1. When task cloning is not used and there is no 
variation in completion times among tasks in the same job 
phase, the scheduling problem in our model just reduces to the 
problem in ({29 1 /. However, the optimization problem presented 
in / |29| / has proven to be NP-Hard even for the offline case 
where all the jobs enter into the cluster at the same time. The 
stochastic optimization problem in Equation 0 therefore is 
NP-Hard and hence we resort to the use of approximation 
algorithms to tackle this problem. 


when r copies are made for the task 6f :l , the average duration 
of 5{’ 3 is ar; j . The derivation of this result is shown in |25|. 

i ot.r—1 1_f 

As such, the speedup function is just sj(r) = f^-i) which 
is strictly concave and monotonic. 

B. A stochastic program formulation for job scheduling 

For any job, all the map tasks and reduce tasks can only 
be scheduled after the job arrival at the cluster and hence 
rn'l > a,;. The Map phase ends when all the map tasks finish, 
i.e., / ) mj = m2 -( -t™’ 3 Vi; 1 < j < mi. Due to the precedence 
constraints of the Map and Reduce phase, a reduce task can not 
begin its work if some map tasks within the same job do not 
finish. Thus, the reduce task can only start after the end of 
the Map phase (i.e., maxlmaxtl/™' 1 }, r {}). At any time slot, 
the total number of machines available for processing the tasks 
and their clones cannot exceed M, i.e., M(t) + R{t) < M. 


IV. Offline Scheduling: All the jobs arrive at the 

CLUSTER AT THE SAME TIME 

Before designing the online algorithm, in this section, we 
consider an offline case where all the jobs enter into the system 
at the same time, namely, a, = 0 Vi. We assume that all the 
tasks cannot be launched simultaneously in the cluster (e.g., 
Si=i c i > -^0’ otherwise, we can assign all the tasks to 
the machines in the cluster and the scheduling process just 
ends. Although this setting is simple, the offline algorithm 
presented below provides good insights for us to design an 
online algorithm in the following sections. 

0 builds a simple model to analyze the advantage of pure 
cloning. It concludes that cloning cannot help to reduce the 
job flowtime if s^jx) < x when the number of tasks to be 
scheduled is larger than M. Therefore, we do not clone extra 
copies for the tasks in this bulk arrival scenario. 























A. Offline Algorithm Design 

It is well known in scheduling literature that the SRPT 
scheduler is optimal for reducing overall flowtime on a single 
machine when there is only one task per job HD- In each 
time slot, the SRPT scheduler always selects the job with 
the minimum total remaining workload to serve. We extend 
this SRPT scheduler to design the following offline scheduling 
algorithm: 

In this algorithm, the scheduler first applies the SRPT 
scheduler for the scheduling problem in which there is only 
one single machine to determine the priority of each job. Let fa 
be the total effective workload of job which is determined 
by the following equation: 

fa = mi ■ (E™ + rcr” 1 ) + n • (.Ej + roj) (2) 


The standard deviation of task duration is incorporated into 
the workload via multiplying by a factor r and the priority of 
job Ji is then defined as Wi/fa. The rationale of including the 
standard derivation of task duration in the effective workload 
of task is that tasks with large variation in completion times 
can easily prolong the job completion and hence should be 
scheduled later. However, it still remains a problem to choose 


a good r and we tackle this problem in Section VI 


After computing the priority for each job, the jobs with 
higher priorities are always scheduled before those ones with 
lower priorities. Whenever a machine is available, the sched¬ 
uler randomly chooses one unscheduled task from the pool 
of not-yet-finished jobs, or the set of alive jobs that keep the 
highest priority and assign it to this machine. Moreover, all the 
map tasks are scheduled before the reduce tasks in the same 
job. Since the Reduce phase can only begin after the Map 
phase finishes, a reduce task cannot make progress even after 
it has been scheduled as long as there are some unfinished 
map tasks within the same job. 


Algorithm 1: Offline Scheduling algorithm for the 
bulk arrival _ 

Input: The jobs associated with c,, Ef and of; 
Output: Allocated machines for all the map tasks and 
reduce tasks. 

1 Sort the job set J based on the decreasing order of 

m/fa ; 

2 Initialize the job set $ = J\ 

3 if A machine is available then 

4 for each job Ji in $ do 

5 if Ji has unscheduled map task then 

6 Choose one unscheduled map task at 
random and assign it to this machine; 

7 else 

8 Choose one unscheduled reduce task at 
random and assign it to this machine; 


9 

to 


if has no unscheduled task then 
[ «> = $ - { Ji}\ 


Algorithm [T] presents the pseudo-code of the algorithm. 

B. Deriving the upper bound for job flowtime 

We proceed to analyze the performance of Algorithm |T| 
Define /? as the accumulated workload of those jobs whose 
priority is larger than job ./,. In other words, 

ft = E & (3) 

We aim to show a generic bound on the flowtime of each job 
with a certain probability. To achieve this goal, we first prove 
the following lemma: 

2_I 

Lemma 1. With probability at least ' f . 2 , the cluster is 
processing the jobs with priority at least Wi/fa during the 
interval [0, /, — Ej — raj\. 

Refer to Appendix [A] for the detailed poof of Lemma |T| 
Based on Lemma [T| we derive the following theorem which 
provides an upper bound for the flowtime of each job: 

Theorem 1. The flowtime of Job Ji is bounded by Ej + raj + 
fl/M with a probability at least 1 + 1/r 4 — ^ • 

Refer to Appendix [B] for the proof of Theorem [T] 

Remark 2. When the variance of the task workload is zero, 
the flowtime of each job is bounded by Ej + /? /M under 
Algorithm [7] Regardless of the type of scheduler, the flowtime 
of each job must be larger than Ej. On the other hand, the 
performance of the optimal scheduler is no better than the 
SRPT scheduler with one machine in terms of weighted sum 
ofjobflowtimes. Under the SRPT scheduler with one machine, 
the flowtime of each job is just ff /M. Hence, we conclude 
that Algorithm [7] achieves a constant competitive ratio of two. 

However, if the variance is non-negligible, Algorithm [7] 
could not achieve a constant competitive ratio but still provides 
an upper bound for the flowtime of each individual job. 

V. Online scheduling with cloning for job 

ARRIVAL OVER TIME 

In this section, we first present an approximated algorithm 
for the online scheduling case where all the jobs arrive at the 
cluster over time. After that, we provide an upper bound for 
the competitive ratio of the proposed algorithm. 

A. Shortest Remaining Processing Time based Machine Shar¬ 
ing Principle 

Extended from the offline algorithm presented in Section 
|IV| we design a SRPT based machine sharing algorithm for 
this online case. The principle of machine sharing is motivated 
by the work in ||9), (10J, 1 12) where the machines are shared 
among the latest jobs arriving at the cluster (LAPS). A classic 
result in 10 shows that the LAPS algorithm is scalable for 
minimizing the total flowtime of jobs with sublinear speedup 
functions on multiple machines. However, in our algorithm, we 
share the machines among jobs with the smallest remaining 
workload. Other than that, we also make clones for the tasks 
according to machine availability. 











We call this approximated algorithm the Shortest Remain¬ 
ing Processing Time based Machine Sharing plus Cloning 
(SRPTMS+C). At a high level, the SRPTMS+C algorithm 
works as follows: At the beginning of each time slot, the 
scheduler computes a priority for every alive job (i.e., not- 
yet-finished job). Let e be a number such that 0 < e < 1. Jobs 
with the highest priorities share the machines in proportion to 
their weights so that the weight of all running jobs is an e 
fraction of the total weights of the alive jobs in the system. 
Observe that when e is set to 1, the scheduler just reduces 
to the fair scheduler in Hadoop 0- On the other hand, if 
e is close to 0, the scheduler becomes the SRPT scheduler. 
By tuning the parameter e, we could obtain a scheduler that 
best fits a cluster. More importantly, this e fraction sharing 
principle yields a bounded competitive ratio as presented in 
the following sections. 

Let ip s (l) be the set of alive jobs at the beginning of time 
slot l. Denote by mj(Z) and r**(Z) the number of unscheduled 
map and reduce tasks of job Ji respectively. The remaining 
effective workload of job Ji can be characterized by: 


Ui(l) = rm(l) ■ (.Er + ran + n(l) • (£[ + ra[) (4) 

The scheduler computes jpfrj for each job in i/j s (l) and 
guarantees that the jobs with larger have higher priorities 
to be scheduled. Let 


w(i)= J2 ■ 


(5) 


and y)f(l) be the set of jobs alive for SRPTMS+C at time slot 
l which have lower priorities (i.e., smaller tjJts) than Jj. J, is 


also included in -0|(Z). Define Wi(l) = Y2 


Ui(l)' 


Wi and let 


•M 


9i(l) = 


(eW(l)) 

0 

(Wi(O-(l-OWj)-M 


Wi(l) — Wi > (1 -e)W(l) 

W f (l) < (l-e)W(l) 

otherwise 


Each job Jj £ ip s {l) shares gi(l) machines within the 
cluster, including those ones that are still running the tasks 
of Jj, whose size is defined as <Xj(Z). Hence, the number of 
machines assigned to Jj in time slot l is (gi(l) — Oj(Z)). 

B. Task-Cloning Algorithm Design 

When allocating the number of machines for each job (i.e., 
(< 7 i(Z) — CTj(Z))), there may exist one case which violates the 


basic sharing principle in Section V-A namely, the number of 
machines running the tasks of Jj (i.e., erj(Z)) already exceeds 
< 7 i(Z) for some i. Under such situation, the scheduler reserves 
the work already completed for job Ji and just runs the tasks 
of Jj with their clones on <7j(Z) machines. In other words, the 
scheduler does not allow preemption and lets Jj occupy these 
extra machines. Due to this non-preemptive mechanism, the 
exact number of machines shared by Job Jj may be larger 
than <?j(Z). 

After the number of machines is allocated for each job, the 
scheduler needs to choose appropriate tasks of the alive jobs 


Algorithm 2: SRPTMS+C Algorithm Design for On¬ 
line Scheduling 


1 Update i[) s [l), the set of jobs which have unscheduled 
tasks at current time slot; 

2 Update the number of available machines M(Z); 

3 Compute Ui(l) for each Jj £ ij> s (l) and sort the jobs 
according to the decreasing order of 

4 Compute W(l) based on Equation (|5j; 

5 for the Job Ji £ ip s (l) do 

6 Compute < 7 j (Z), the number of machines Ji 
deserved according to the e fractional sharing 
policy; 

7 for the Job Ji £ && gt(l) > 0 do 

8 Count the number of machines which still run the 
tasks of Ji including all the clones and denote it 
by CTj(Z); 

9 Compute the number of newly available machines 
which is 6(0 = gi(l) - a z (l)-, 

to if 6(0 < 0 then 

it continue; 


12 

13 

14 

15 

16 

17 

18 

19 


if 6(0 < M(0 then 

Assign 6(0 extra machines to Jp, 

Call the task scheduling procedure for Jj with 
6(0 machines with returning value tt ; (Z); 

_ M(0 -= 7Tj(0; 
if 6(0 — M(l) then 

Assign M(l) extra machines to Jp, 

Call the task scheduling procedure for Ji with 
M(0 machines with returning value 7Tj(Z); 
M(Z) -= 77,(0; 


20 return; 


Procedure Task Scheduling for Job Jj with x newly 
allocated machines _ 

Input: The number of newly allocated machines x and 
the running status; 

Output: Task scheduling decision for Ji and returning 
value 7Tj(Z). 

1 Count TOj(Z) and ?’j(/), the number of unscheduled 
map tasks and reduce tasks for Jj respectively; 

2 if TOj(Z) > 0 && Wj(0 > x then 

3 run [a;/mj(Z)] copies for each unscheduled task 
on available machines. 

4 return x — [x/rrii(l)\ * mj(Z); 

5 else if rrii{l) > 0 && rrii(l) < x then 

6 I Choose x unscheduled map tasks uniformly at 

random and run one copy for each task on 
available machines; 

7 return 0; 

8 else 

9 Repeat the same scheduling process for reduce 
tasks with x allocated machines. 

















for scheduling and make cloning decisions carefully. Follow¬ 
ing the precedence constraint of the Map and Reduce phase, 
the scheduler begins to schedule reduce tasks after all the map 
tasks completed. In addition, the clones are made for the tasks 
depending on whether the number of machines allocated to a 
particular job is larger than the number of unscheduled tasks. 
Take job Ji for example: When gi(l) — Oi(l) > Ci(l ), cloning 
will be made to fully utilize these machines allocated to Ji. 
To be more specific, the scheduler spawns the same number of 
clones for all the unscheduled tasks in Otherwise, tasks 
with fewer clones are more likely to lag behind. Thus, each 
unscheduled task of Ji will be made \{gi(l) — <Ji(l))/ c,(7)] 
[^copies. In contrast, when gi(l) — Oi(l) < Ci(l), following 
the same argument of the offline scheduling algorithm, clones 
are not made in this case. Hence, the scheduler chooses some 
unscheduled tasks from Ji(l) at random and launch it without 
cloning. 

Algorithm [2] presents the pseudo-code of the algorithm. 

C. Resource augmentation analysis 

In this section, we use resource augmentation to analyze 
the performance of the SRPTMS+C algorithm. Before going 
to the details of the analysis, we first present the following 
definition which characterizes the performance of an approx¬ 
imated algorithm. 

Definition 1. An approximated algorithm is s-speed c- 
competitive if the algorithm’s objective is within a factor of c 
of the optimal solution’s objective when the algorithm is given 
s resource augmentation m 

Proposition 1. Consider any continuous and concave function 
f : R + —> R + with /(0) > 0. Then for any b > a > 0, we 
have > tjp-. 

Proof: According to the definition of concave function, it 
holds that f{Ax + (1 — A )y) > A f(x) + (1 — A )f{y) for any 
x,y £ R + and A £ [0,1]. Specially, consider x = 0, y = b 
and A = 1 — Then we have /(a) > (1 — f )/(0) + | f(b) > 

§/(&)• Q-E.D. ■ 

Remark 3. Based on Proposition |7j we conclude that /(| • 
a) > jtf(a), this can be proved by substituting x = | • a and 
y = a into the inequality. 

With the help of Proposition [T] we derive the following 
theorem which provides an upper bound for the competitive 
ratio of SRPTMS+C. 

Theorem 2. The algorithm SRPTMS+C is (1 + e) — speed 
o(^) — competitive for the expectation of weighted sum of 
job flowtimes when 0 < e < 1. 

The method of potential function analysis is widely adopted 
to derive performance bound with resource augmentation for 
online parallel scheduling algorithms in the literature (e.g., [9j, 
(ig, (T3), (23)). The key step of this method is to define a 
proper potential function which combines the adversary and 

1 [a:] denotes the rounding of the real number x. 


Table II 

Google trace data statistics 


Total number of Jobs 

6064 

Trace duration (s) 

35032 

Average number of tasks per job 

26.31 

Minimum task duration (s) 

12.8 

Maximum task duration (s) 

22919.3 

Average task duration (s) 

1179.7 


our algorithm. To be more specific, let A(t) and OPT{t) 
be the accumulated weighted sum of job flowtimes in the 
algorithm’s and adversary’s schedules, respectively. We define 
a potential function $(t) that satisfies the following properties 
which are extended from m 

• Boundary Condition: $(0) = <l>(oc) = 0. 

• Changes Condition when job arrives or completes: the 
value of the potential function decreases or remains the 
same when a job arrives or completes in our algorithm 
and the adversary. 

• Dynamic Changes Condition: with e resource augmen¬ 
tation, at any time when no job arrives or completes, 

n^\+n^} < mPP 1 }- 

By integrating over time, one can see that the existence of 
such a potential function is sufficient to yield a (1 + e) — speed 
o(Jj-) — competitive algorithm. Refer to Appendix Kj for the 
detailed proof. 

VI. Performance Evaluation 

In this section, we evaluate the performance of the 
SRPTMS+C algorithm via extensive simulations driven by 
Google cluster-usage traces E3- The traces contain the in¬ 
formation of job submission and completion time of Google 
services on a cluster of 12K servers. It also includes the 
number of tasks as well as the duration of each task. In 
addition, the priority for each job ranges from 0 to 11 and we 
just treate this priority as the job weight. From the traces, we 
extract the statistics of more than 6000 jobs during a 12-hour 
period. We already exclude those jobs which have specific 
constraints on machine attributes. The detailed job statistics 
are illustrated in Table [II] 

When running the simulations, we estimate the distribution 
for the workloads of all the tasks within each job phase. Once 
a cloning copy is made for a particular task, the workload 
for this clone is just drawn independently from the estimated 
distribution. We repeat the same simulation for each of the 
following evaluations ten times and take the average to obtain 
the final result. 

A. Baseline Algorithms for comparison 

We adopt the following two algorithms as the baselines to 
compare with the SRPTMS+C algorithm: 

. Microsoft Mantri’s Speculative Execution Scheme: 
The speculative execution scheme of Mantri is demon¬ 
strated to be the most effective one among all the 
straggler-detection based schemes 0. Mantri estimates 













Figure 1. The weighted/unweighted average of 
job flowtimes for different e under the SRPTMS+C 
algorithm when r = 0. 



Figure 2. The weighted/unweighted average of 
job flowtimes for different r under the SRPTMS+C 
algorithm when e = 0.6. 



Figure 3. The weighted/unweighted average of 
job flowtimes under different number of machines 
for SRPTMS+C when e = 0.6 and r = 3. 


the remaining time to finish, t rem , for each task and 
predicts the required execution time of a relaunched copy 
of the task, t new . Once a machine becomes available, 
the system makes a decision on whether to launch a 
backup copy based on the statistics of t rem and t new . 
Specifically, another copy is launched if the inequality 
^(j'rem ^ 2 * t new ) > 5 holds. 

• Smart Cloning Algorithm (SCA): SCA is a cloning 
algorithm which is proposed in | j26| . At the beginning 
of each time slot, SCA first runs a convex program to 
determine the number of copies assigned for each task 
and then launch all the copies simultaneously on available 
machines. SCA has been demonstrated to cut down the 
elapsed time of small jobs substantially. 

Instead of comparing the weighted sum of job flowtimes 
directly, we take the weighted average for ease of presentation. 
Moreover, we also compare the unweighted average as well 
as the cumulative distribution function (i.e., CDF) of job 
flowtimes against different algorithms. The time scale of each 
slot is 1 second in our simulations. 

B. The impact of e, r and the number of machines in the 
cluster 

In this subsection, we first evaluate the impact of e and 
r on the average weighted/unweighted job flowtime under 
the SRPTMS+C algorithm in the cluster that contains 12K 
machines. Fig. |T| depicts the evaluation result under different e 
when r = 0. Observe that when e = 0.6, which corresponding 
to the scheduler that schedules nearly half of the alive jobs 
with smaller effective workloads in each time slot, both of 
these two metrics attain the minimum. 

To further evaluate the impact of r on the cluster perfor¬ 
mance, we set e to 0.6 and evaluate the weighted/unweighted 
average of job flowtimes for different r under SRPTMS+C 
algorithm. It shows in Fig. [2]that the unweighted average of job 
flowtimes attain the minimum when r = 3 while the weighted 
average reaches its minimum under r = 8. In fact, both of 
these two metrics do not vary much between different r, the 
major reason is that the variation of task duration within each 
job phase for this particular job trace is small. 

On the other hand, we scale out different number of ma¬ 
chines in the cluster to show the impact on the job flowtime. 


Observe from Fig. [3] that when the number of machines is 
around 8K, the performance is as equally well as it in the 
original cluster with 12K machines. There is enough resources 
to make clones for small jobs under the SRPTMS+C algorithm 
although the cluster only has 8K machines. The flowtime of 
small jobs from this trace therefore reduces substantially under 
SRPTMS+C. 

C. Comparison against baseline algorithms 

Based on the evaluation results above, we choose e to be 0.6 
and r to be 3 for the SRPTMS+C algorithm. We implement the 
three baseline algorithms as presented in their original papers 
in the cluster that contains 12K machines. The comparison 
results are illustrated in Fig. [4] and Fig. 0 Kg- 0 depicts 
the CDF of flowtime for the small jobs whose flowtime is 
between 0 and 300 seconds. It indicates that the SRPTMS+C 
algorithm obtains the best performance for those small jobs. 
In SRPTMS+C, more than 50% jobs complete within 100 
seconds. In contrast, about 46% and 44% jobs complete within 
100 seconds under SCA and Mantri respectively. 

Fig. [5] depicts the CDF of flowtime for the big jobs whose 
flowtime are between 300 and 4000 seconds. One can see that 
the SRPTMS+C algorithm still achieves the best performance 
for these big jobs. For instance, about 90% jobs can complete 
within 1000 seconds under SRPTMS+C while only 88% and 
86% jobs can complete within such time-span under SCA and 
Mantri respectively. 

We illustrate the weighted/unweighted average of job flow- 
times for this trace under different algorithms in Fig. [6] It 
shows that both of these two metrics under SRPTMS+C are 
reduced by nearly 25% comparing to Mantri baseline scheme. 
More importantly, the SRPTMS+C algorithm is much more 
efficient comparing to Mantri scheme in terms of implementa¬ 
tion as the latter needs to monitor the progress of each running 
task which induces an extra system instrumentation. 

VII. Conclusions 

In this paper, we study the online scheduling problem in 
a MapReduce cluster and formulate a stochastic optimization 
program with the objective to minimize the weighted sum of 
job flowtimes. Following this model, we design the straggler- 
mitigation algorithms via task cloning, which are motivated 




















Figure 4. The cumulative fraction of the jobs 
within the flowtime ranging from 0 to 300 seconds 
under different algorithms. 



Figure 5. The cumulative fraction of the jobs 
within the flowtime ranging from 500 to 4000 
seconds under different algorithms. 



Figure 6. The weighted/unweighted average of 
job flowtimes under different algorithms within the 
cluster that has 12K machines. 


from the SRPT scheduler in both offline and online cases. 
In the offline case, we show that, with high probability, 
each job can complete within a time-span which is larger 
than the optimal scheduling algorithm by only a constant 
factor times the standard derivation of task duration under our 
algorithm. When the variance of task duration is negligible, 
the offline algorithm achieves a competitive ratio of 2. On 
the other hand, we present the SRPTMS+C algorithm for the 
online case and provide an upper bound for the competitive 
ratio through the potential function analysis. Finally, we run 
several trace-driven simulations to evaluate the performance 
of the SRPTMS+C algorithm. It shows that SRPTMS+C cuts 
down the flowtime of small jobs substantially and reduces 
the wighted/unweighted sum of job flowtimes by nearly 25% 
comparing to Mantri baseline scheme. 
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Appendix 

A. Proof of Lemma [7] 

Proof: Consider the following case where the cluster is 
processing some jobs with priorities smaller than iUi/(f>i during 
the interval [0, fi — E[ — r<r[]. In this case, the job J, must 
have already scheduled all the reduce tasks at time fi — Ef — 























r<j\ according to the offline algorithm we present. Further, we 
consider the reduce task which is finished at last in job ./, 
and let S’-' 3 denote it. According to the definition of /,, S’-' 3 
finishes its work at time /,;. It implies that the workload of 
S’' 3 is at least IS'-' + ra [. Applying the Chebyshev Inequality 
m here, we get the following formula: 

PrW > EJ + ral} < Pr{ \r 3 - E\\ > ra \} < 1 (6) 

This completes the proof. ■ 

B. Proof of Theorem [7] 

Proof: Denote by X the work that the cluster has pro¬ 
cessed for the jobs with priority at least Wi/<f>i. Thus, the 
following two equations hold: 


E[X) 

E m j ‘ EJ 1 -f- rj ■ ES. 

(7) 


j-Wj/<t>j>Wi/<t>i 


a 2 [X] = 

E • K) 2 + r J • K) 2 - 

(8) 


r-Wjl<t>j>Wil<t>i 



Applying the Chebyshev Inequality again, we conclude that 
the probability that X is no less than ff is bounded by 


Pr{*>/?} 


<Pv{\X-E[X]\>r 






< 


Z, 


j:Wj/<t>j>Wi/4>i 


i a j) 2 + r j ' ( a j) 2 


IrZ, 




{"'.I ■ + r -i ' CT P ] 2 


1 

< — 


Applying Lemma 1 here, with probability at least r r f 1 
cluster is processing the work of X during the interval [0 
El — ra]']. There are M machines processing the task with 
speed in total. Thus, with probability at least (r 2 — l) 2 /r 4 
following inequality holds. 


(9) 

, the 

fi~ 

unit 
, the 


M * (fa — ES — raS) < ff 

The theorem immediately follows. Q.E.D. 

C. Proof of Theorem [2] 


( 10 ) 


In the algorithm design of Section V-A we assume that 


time is slotted. For convenience of analysis, here we consider 
a more general case where the time is continuous. In fact, we 
just make the length of a time slot small enough, as long as 
the duration of each task is the multiples of a slot length, our 
analysis doesn’t violate the algorithm setting. 

Proof: Let y 3 (t) — ma x{pf\t) — pf 3 (t),0} where 
pf 3 (t) and pf 3 (f) denote the remaining workload to be 
processed for task S 3 in Job ./, at time t under the optimal 
scheduling policy and SRPTMS+C algorithm respectively. Let 
f°{t) and ip°(t) be the jobs and tasks that are still alive (have 


not completed yet) at time t in the optimal scheduling. Further 
denote by ip s (t) the set of jobs that are alive in SRPTMS+C. 

Denote by t s f 3 and t 3 ’ 3 the start time and completion time 
of task S 3 respectively. Based on the constraints ( |Td} and ( p~e] >, 
it follows that 


moreover, we 


E 


t f, j _ t °. 


S,J 


= E 


t 3 


/Si(x 3 ) 


have 


E 

/' dpf 3 (t) 

= E 



Jtf 3 


J 


( 11 ) 


( 12 ) 


Substituting Equation ( |TTj ) into Equation ( [T2| yields the fol¬ 
lowing formula: 


E 


dpf J (t) 

dt 




(13) 


The potential function for a single task is defined as follows: 


<4® 


WiVi if) 

Si(wiM/eW(t)) 


(14) 


Our overall potential function for all the jobs in the cluster is 
defined as 

= E H El (t) (15) 


The Potential funciton is differentiable and it holds that 
'L(O) = \&(oo) = 0 and 


E 

dt 

ii 

r ^i - 

M 

M 

m 

dpi (t) 
dt 






Let Cf be the completion time for Job Ji under 
SRPTMS+C algorithm. For any time t f A, let Aft) = 
^(minjCj 4 , t} — r'i ) be the accumulated weighted flow time 
of Job Ji at time t, then we must have 


E 


dAft) 

dt 


= w i 


for r t <t <Cf 


(17) 


Afoo) is just the flowtime of Job in the cluster. Hence, the 
total flowtime of all the jobs in the cluster can be formulated 
as A = ^AAj(oo). Futher let Ai = AfCf) and A(t) = 
X)i Aft). Let OPTi(f), OPT*, OPT(f) and OPT be defined 
similarly for the optimal scheduling policy. 

Similar to the potential-function based analysis 

(23}, our goal is to bound the continuous and discrete 
increases to T'(f) by a function of OPT. 

We now focus on the the changes made to 'L(t). It’s obvious 
that the job arrivals make no change to this metric. In addition, 
the completion of jobs in the optimal schedule has no effect 
on the potential function value. The completion of jobs in 
SRPTMS+C causes the corresponding term being removed 
from 'L(f), however, it only decreases the potential and we 
just omit it as our goal is to obtain the upper bound for the 
changes made to T '(t). As a result, we only need to analyze 
the continuous change to 'L(f). 


























• Changes in T)/:) due to the optimal scheduling policy which 
is define as A °(t): 

Let af 3 be the number of machines assigned to task Sj of 
Job Ji in the optimal scheduling policy. Based on Equation 
€3 and the definition of potential function, the contribution 
made by the optimal scheduling to gjE[^(t)] is bounded by 
the following formula: 


Substitute Equation ( |24} , ( [25] ) and ([26} into Equation ([23}, it 
yields that 


A °{t) < g-E 


d QPT(t) 


dt 


-E 


dA{t ) 


dt 


(27) 


£ 2 A° j = 




(18) 


Si(u>iM / eW (t)) 

;s for af 3 whi 
WiM/eW(t) and a® 3 > WiM/eW(t). For the former case. 


There are two categories for af 3 which are af 3 ^ 


appling the monotonic property of ,s t function for all i, we have 
si^M/eW^t)) ^ Wi ' For the latter case ’ applying Proposition 


1, we get 

w i s i ( a ? J ) 


€ Wi 


a? 3 


a? 3 


WiM/eW(t) £W ® M 


Si(wiM/eW(t)) 

Combining the two cases, it follows that 

£ 2 A° J (f) < max | Wi,sW(t)^jj 

Oj 

< Wi+£W{t)°^ 


which indicates 

A °(t) 
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^ E E 

. W(t) yy \ ' 
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(19) 

( 20 ) 
( 21 ) 

( 22 ) 


Oj 


C \ - 

^ ^ E 
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E 


d OPT(f) 
dt 


For the second term in Equation ( |23} , we have 

E E < M 

and 


w(t)= e »*= E E 

i(zip s (t) 


dAj(t) 

= E 

dA(t) 

dt 


dt 


We proceed to analyze the changes to 'T(f) made by our 
SRPTMS+C scheduling. 

• Changes in ’l'(f) due to the SRPTMS+C scheduling policy 
which is defined as A s (t): 

For each task that is alive in SRPTMS+C at time t, if 
it completes the work in the optimal scheduling policy, then 
yj ( t ) is positive. Hence, yj ( t ) decreases for all tasks S,- ^ 
ip°(t) that SRPTMS+C processes at time t. 

We run our algorithm at speed of 1 + e. Let g J be the 
number of machines assigned to task Sj in SRPTMS+C at 
time t. According to our scheduling policy, we have JE af 3 ^ 
gi(t) for all £ ip s (t). It follows that 

WiS^ 3 ) 


(24) 


(25) 


( 26 ) 
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(23) 

For the first term of Equation d23}, it follows that 


1 + e 
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Sj 


w< f> E A 

5j eV’j(t) 


(28) 


IEeJ, c (t) Wi E Cwi where C is the maximum number of 
copies made for each task in the optimal scheduling algorithm. 
Hence, 

^ E E ^ < § E Wi 

ieV'°(*)n'0 e, (t) <5)e Jf(t) 


The second inequality in the above follows Proposition [I and 
the fact that af 3 ^ 9i{t) = eV+(t) • To bound Inequality (28}, 
we need to bound the second term as follows: 


E * 
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d OPT(t) 
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dt 


(29) 

(30) 


It addition, we have 

E 9i{t) = M (31) 

i£i/j s (t) 

Substitute Inequality ( [29} . ( [30} and Equation ( [31} into Inequal¬ 
ity ([28}, it holds that 


A S (t) < ')W(t) 
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1 + £ 
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V e 2 ) 

dt 


dA(t) 
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d OPT(f) 
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( 32 ) 


















































Wr proceed to complete the final analysis based on the 
results derived above. Due to the fact that J 0 °° IE dt = 

lE[^'(oo)] — JE[\l/(0)] = 0, we have 



= ( C+ J> + g ) E [O pT ] (33) 


This completes the proof. ■ 



